partial gradient
Accelerated Mini-batch Randomized Block Coordinate Descent Method
We consider regularized empirical risk minimization problems. In particular, we minimize the sum of a smooth empirical risk function and a nonsmooth regularization function. When the regularization function is block separable, we can solve the minimization problems in a randomized block coordinate descent (RBCD) manner. Existing RBCD methods usually decrease the objective value by exploiting the partial gradient of a randomly selected block of coordinates in each iteration. Thus they need all data to be accessible so that the partial gradient of the block gradient can be exactly obtained.
Accelerated Mini-batch Randomized Block Coordinate Descent Method
Tuo Zhao, Mo Yu, Yiming Wang, Raman Arora, Han Liu
We consider regularized empirical risk minimization problems. In particular, we minimize the sum of a smooth empirical risk function and a nonsmooth regularization function. When the regularization function is block separable, we can solve the minimization problems in a randomized block coordinate descent (RBCD) manner. Existing RBCD methods usually decrease the objective value by exploiting the partial gradient of a randomly selected block of coordinates in each iteration. Thus they need all data to be accessible so that the partial gradient of the block gradient can be exactly obtained.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > China > Heilongjiang Province > Harbin (0.04)
Reviews: The Sound of APALM Clapping: Faster Nonsmooth Nonconvex Optimization with Stochastic Asynchronous PALM
I find the approach rather interesting, especially the broad and general definition of the problem makes the approach applicable to a wide range of problems. However, I was surprised by the absence of any reference to the seminal Robbins/Munro paper and also to the recent developments in stochastic gradient descent based sampling (see below). The authors do local gradient descent updates of coordinate blocks by computing partial gradients and adding noise in each asynchronous step. I was wondering, how this relates to the "usual" stochastic gradient descent update, i.e., given that the locally computed partial gradient will be based on delayed (noisy) variable states, a sequence of these noisy partial gradients would converge to the true partial gradient as well. Further, recent SGD based sampling has shown that adding noise to the variable states obtained by noisy gradient updates (as the authors do as well) provides good samples of the distribution underlying the optimal variable setting also in a non-convex setting. That being said, the work handed in remains valid, but it would have been interesting to compare the proposed approach to well established stochastic gradient methods.
Evolutionary Reinforcement Learning via Cooperative Coevolution
Hu, Chengpeng, Liu, Jialin, Yao, Xin
Recently, evolutionary reinforcement learning has obtained much attention in various domains. Maintaining a population of actors, evolutionary reinforcement learning utilises the collected experiences to improve the behaviour policy through efficient exploration. However, the poor scalability of genetic operators limits the efficiency of optimising high-dimensional neural networks. To address this issue, this paper proposes a novel cooperative coevolutionary reinforcement learning (CoERL) algorithm. Inspired by cooperative coevolution, CoERL periodically and adaptively decomposes the policy optimisation problem into multiple subproblems and evolves a population of neural networks for each of the subproblems. Instead of using genetic operators, CoERL directly searches for partial gradients to update the policy. Updating policy with partial gradients maintains consistency between the behaviour spaces of parents and offspring across generations. The experiences collected by the population are then used to improve the entire policy, which enhances the sampling efficiency. Experiments on six benchmark locomotion tasks demonstrate that CoERL outperforms seven state-of-the-art algorithms and baselines. Ablation study verifies the unique contribution of CoERL's core ingredients.
Accelerated Mini-batch Randomized Block Coordinate Descent Method Yiming Wang
We consider regularized empirical risk minimization problems. In particular, we minimize the sum of a smooth empirical risk function and a nonsmooth regularization function. When the regularization function is block separable, we can solve the minimization problems in a randomized block coordinate descent (RBCD) manner. Existing RBCD methods usually decrease the objective value by exploiting the partial gradient of a randomly selected block of coordinates in each iteration. Thus they need all data to be accessible so that the partial gradient of the block gradient can be exactly obtained.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > China > Heilongjiang Province > Harbin (0.04)
Sequential Gradient Coding For Straggler Mitigation
Krishnan, M. Nikhil, Ebrahimi, MohammadReza, Khisti, Ashish
In distributed computing, slower nodes (stragglers) usually become a bottleneck. Gradient Coding (GC), introduced by Tandon et al., is an efficient technique that uses principles of error-correcting codes to distribute gradient computation in the presence of stragglers. In this paper, we consider the distributed computation of a sequence of gradients $\{g(1),g(2),\ldots,g(J)\}$, where processing of each gradient $g(t)$ starts in round-$t$ and finishes by round-$(t+T)$. Here $T\geq 0$ denotes a delay parameter. For the GC scheme, coding is only across computing nodes and this results in a solution where $T=0$. On the other hand, having $T>0$ allows for designing schemes which exploit the temporal dimension as well. In this work, we propose two schemes that demonstrate improved performance compared to GC. Our first scheme combines GC with selective repetition of previously unfinished tasks and achieves improved straggler mitigation. In our second scheme, which constitutes our main contribution, we apply GC to a subset of the tasks and repetition for the remainder of the tasks. We then multiplex these two classes of tasks across workers and rounds in an adaptive manner, based on past straggler patterns. Using theoretical analysis, we demonstrate that our second scheme achieves significant reduction in the computational load. In our experiments, we study a practical setting of concurrently training multiple neural networks over an AWS Lambda cluster involving 256 worker nodes, where our framework naturally applies. We demonstrate that the latter scheme can yield a 16\% improvement in runtime over the baseline GC scheme, in the presence of naturally occurring, non-simulated stragglers.
Nested Gradient Codes for Straggler Mitigation in Distributed Machine Learning
Maßny, Luis, Hofmeister, Christoph, Egger, Maximilian, Bitar, Rawad, Wachter-Zeh, Antonia
We consider distributed learning in the presence of slow and unresponsive worker nodes, referred to as stragglers. In order to mitigate the effect of stragglers, gradient coding redundantly assigns partial computations to the worker such that the overall result can be recovered from only the non-straggling workers. Gradient codes are designed to tolerate a fixed number of stragglers. Since the number of stragglers in practice is random and unknown a priori, tolerating a fixed number of stragglers can yield a sub-optimal computation load and can result in higher latency. We propose a gradient coding scheme that can tolerate a flexible number of stragglers by carefully concatenating gradient codes for different straggler tolerance. By proper task scheduling and small additional signaling, our scheme adapts the computation load of the workers to the actual number of stragglers. We analyze the latency of our proposed scheme and show that it has a significantly lower latency than gradient codes.
Adaptive Stochastic Gradient Descent for Fast and Communication-Efficient Distributed Learning
Hanna, Serge Kas, Bitar, Rawad, Parag, Parimal, Dasari, Venkat, Rouayheb, Salim El
We consider the setting where a master wants to run a distributed stochastic gradient descent (SGD) algorithm on $n$ workers, each having a subset of the data. Distributed SGD may suffer from the effect of stragglers, i.e., slow or unresponsive workers who cause delays. One solution studied in the literature is to wait at each iteration for the responses of the fastest $k
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > New Jersey > Middlesex County > Piscataway (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- (2 more...)
Dynamic Network-Assisted D2D-Aided Coded Distributed Learning
Zeulin, Nikita, Galinina, Olga, Himayat, Nageen, Andreev, Sergey, Heath, Robert W. Jr
Today, various machine learning (ML) applications offer continuous data processing and real-time data analytics at the edge of a wireless network. Distributed real-time ML solutions are highly sensitive to the so-called straggler effect caused by resource heterogeneity and alleviated by various computation offloading mechanisms that seriously challenge the communication efficiency, especially in large-scale scenarios. To decrease the communication overhead, we rely on device-to-device (D2D) connectivity that improves spectrum utilization and allows efficient data exchange between devices in proximity. In particular, we design a novel D2D-aided coded federated learning method (D2D-CFL) for efficient load balancing across devices. The proposed solution captures system dynamics, including data (time-dependent learning model, varied intensity of data arrivals), device (diverse computational resources and volume of training data), and deployment (varied locations and D2D graph connectivity). To minimize the number of communication rounds, we derive an optimal compression rate for achieving minimum processing time and establish its connection with the convergence time. The resulting optimization problem provides suboptimal compression parameters, which improve the total training time. Our proposed method is beneficial for real-time collaborative applications, where the users continuously generate training data resulting in the model drift.
- North America > Canada > Ontario > Toronto (0.14)
- Europe > Finland > Pirkanmaa > Tampere (0.04)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- (2 more...)
Losses, Dissonances, and Distortions
In recent years, there has been a growing interest in using machine learning models for creative purposes. In most cases, this is with the use of large generative models which, as their name implies, can generate high-quality and realistic outputs in music [Huang et al., 2019], images [Esser et al., 2021], text [Brown et al., 2020], and others. The standard approach for artistic creation using these models is to take a pre-trained model (or set of models) and use them for producing output. The artist directs the model's generation by "navigating" the latent space [Castro, 2020], fine-tuning the trained parameters [Dinculescu et al., 2019], or using the model's output to steer another generative process [White, 2019, Castro, 2019]. At a high-level what all these approaches are doing is converting the numerical signal of a machine learning model's output into art, whether implicitly or explicitly. However, in most (if not all) cases they only do so after the initial model has been trained.
- Leisure & Entertainment (0.71)
- Media > Music (0.48)